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ABSTRACT 

. A study investigated the validity of an English 

listening skills test by comparing the results of native American and 
British English speakers with those of Dutch students of English as a 
second language. A hypothesis suggested that two-thirds of the items 
would test listening skills and the remaining third would test other 
knowledge. Test results were analyzed according to both classical 
test theory and the Rasch item response theory. The findings showed 
the test to be disappointing as a measure of listening comprehension 
skills, but did suggest that the language listening ability of first- 
and second-language learners can be measured along a single variable 
that can be distinguished from age and educational background. 
(MSE) 
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LISTENING, A SINGLE TRAIT IN FIRST AND SECOND LANGUAGE LEARNING. 
John H*A*L« de Jong 

National Institute of Educational Measurement. 
Cito, Arnhem. 



Introduction 



in appUed linguistics the pendulum reaularly swinas from 
theories based on a clear distinction between first and second 
language learning to theories stressing the similarities in 
both processes of language acquisition. rHe contrastive 
analysis hypothesis (Lado, 1957) relates learner difficulty to 
differences between target and native language. Lanauaae 
transfer in this theory is a dominant force in foreign "languaae 
learning, in parallel with Chomsky's (1959, 1968) rejectJoHr 
the structuralistic and behaviourist approach to lan^iage 
acquisition, the contrastive analysis hypothesis in its strona 
form proved to be untenable and theories on language learning 
have focused on understanding the principles of first lanauaae 
?eS?i J;;,""^ applicability to foreign lanauage ™ " 

ihf! K?;.^^ development in Corder's publications 'reflects 
this shift in attention (Corder, 1981). Krashen formulated a 

^Sfii'^ioS, n language acquisition (Krashen, 

1981, 1982, Burt e.a., 1982) in which the natural order 
hypothesis is clearly related to Chomsjcy's concept of an innate 

(Chomslcy, 1981). But Krashen's attempt to 
Itlt on J ^''^ framewor)c from a number of widely accepted 

ideas on second language acquisition lac)cs sufficient 
foundation (McLaughlin , 1978, Gregg, 1984, Corder, 1984) and 
??n™!?-^ y^^ff" receiving renewed .nterest from applied 
linguists (Kellerman, 1983, ,Schachter, '983). The swinaino of 
the pendulum, however, causes the hands of the cloc)c to move 
oJlrf^l" S'!?" that language .ni -..sal. serve l^ an 

overall guiding principle in second languaoe acquisition, 
interacting with the systems in the native and in the taraet 
language, thus combining both princip'»s. 

In language testing there is a controversy between advocates of 
discrete point testing and integrative testina. In the early 
years of testing the stress put on the necessity to break down 
language competence into different skills and e^en co^stitl^^ 

IT^JZT °f "'"f" ^"^^ structuralist ^po^^ch 

to language learning. In the "Post Modern Phase" (Spolsky, 
1984) more attention has been given to the testing Vf laJ^uaae 
use, in -authentic- situations, testina communicative ^^"^""^ 

Zroi^"?978?"°i' 'J^u "^^"^'^ Ca"«l«' 1983, 

Morrow, 1978), and at the same time holistic (Conlan, 1983) and 

JjvoSr! "° ''^^ ba°^ into 

dfffJc^?i-?« f^°°"^ language acquisition are related, then the 
difficulties in processing samples of a language should be 
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similar for foreign language learners and for native speakers. 
This study is an attempt to prove that a single trait underlies 
the performance of native and non-native speakers on a 
listening comprehension test, -nie power of a test to measure 
differences in ability amongst individuals or aroups of 
individuals depends on the homogeneity and validity of the set 
of items contained in the test and on the level of difficulty of 
the items in relation to the ability of the oersons to be 
measured. The hypothesis to be tested then is: a set of items 
that discriminates amongst non-native speakers with respect to 
their level of ability in performing a particular foreign 
language task will discriminate on the same trait amonast 
native speakers of that language provided that the test is not 
too easy for them. De Jong (1983) demonstrated a procedure for 
making a best selection of items by means of a series of Rasch 
analyses. From a listening con5)rehension test badly fittina 
items with low discrimination indexes were deleted in each" 
subsequent analysis. It was concluded that a selection of two 
thirds of the items in the test constituted a valid measure for 
listening comprehension of English as a foreign language. The 
remaining part was thought to test a different ability, 
possibly general intelligence or knowledge of the world. If the 
hypothesis is not rejected the same selection of items will 
discriminate consistently between native speakers differing in 
age and/or educational background and conseauently differing in 
command of their mother tongue. A test composed of the rejected 
items, however, will reveal a different r-lation between the 
native speakers concerned, as this test taps a different trait. 



Method 



The test used in this study was constructed as a pilot test of 
i. ! °^ English as a foreign lanquaae at the 

Dutch Natxonal Institute for Educational Measurement (cito, 
Arnhem) in a research project designed to develop new methods 
of testing listening comprehension (De Jong and Van den 
Nieuwenhof, 1982; De Jong, 1984). The test uses life recordings 
taken from British and American radio programmes cut into 
samples of about 20 seconds each. Ttestees listen to the tape 
once and have to respond to a multiple choice question with two 
options printed in a test booklet within the 10 second cause in 
between samples provided on the tape. Two item formats ^ere 
used: true-false items (, was the statement in the test booklet 
in accordance with what was said on the tape?), and modified 
cloze items: words to be deleted from the text were chosen for 
their semantic relevance in the context. In each sample ore 
word - or group of words - was cut out from the tape and 
replaced by an electronical sound. Ttestees had to decide which 
of the two options presented in their test booklet could be 
used to restore the text. The test in this study contained 
three types of language use: a discussion, a telephone 
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conversation and « regular news programne. itotal teat l^n„^h 

2e ::ver:f V° -•^-^^-^ the forelS LZaae at 

SclcJ'S^d- (Se J^ng S[ educational 

oroups of sub^cSr' ^ ^^^^ ''^ administered to three 

graduating at American High School level. 

it^Zts °^7"' ^'^^^^^^ ^''^ target population: 

wSs^J^i years old, in their final ye^ a^ Dutch 
S^^^^ i° preparing for their examinations whiSS allow 
ITs'^sZZTelT ""-"^^ taKeilrom 

Sfto-^r r::j:„r^^^^^^^ j^t^^rtirL^-^ 

Rasch model (Rasch, 1960) was cLseS. Parameter 

and ^observable traltia Gf jih^n*.^-- '^ao-^e test performance 
tir^its or aDiiitles assumed to underlie 

the same ability the differences in^hnf measure 

result in the sL, diff'^eLTin PrSiiStrf^r^hH^^^^^ 

Of aettina an item rloht ^^^r^ ^''""^"^•^^^y these persons 

the differences'^: LaJ'aMUtv o ^^^ups 

equal differences in probabiliL Lr^h« • "^""^ ""^^"^^ ^" 
each and every item iS te^! " ^''""^^ 

procedure ?uSS?'oJ .h unconditional :„aximum likelikood 

I'i.uueaure lucoN) of the orogramme was used for- ^ ^ ^ 
gathered from the samples from the Jar-t J^o i J? 
PROX-procedure («righ? andl^^ne^ i979riaruaef t^* 
item and person parameters from tte data o^ afl "^°"i*te 
randomly to 30 subjects per arou^to r^L out Lli "^""^ 
sample si.e and to compute .ro^lnl^lll^Tr.^^^^^^^ 



Results 



Table 1 presents mean scores and standa-d deviation r.* 

in proportion of test length and reliaSni^rtSz;? If ^r"^ 

total listening comprehension test as obse^Jerfo^the t'h'r%e 
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different groups* 



Table 1 Results of native speakers (1 and 2) and L2 Learners 
(3) on total test (n « 59) 



Group 


1 


2 


3 


N 


30 


44 


575 


mean p 


• 85 


• 76 


• 72 


S.D. 


• 05 


• 08 


• 09 


KR20 


•45 


• 56 


•60 



For a test of foreign language listening comprehension these 
results are rather disappointing, for a number of reasons • 
Reliability In the target aroup (3) Is low; even at the 
standard length for Dutch National listening comprehension 
tests (75 Items with two options or 50 Items with three 
options) Spearman-Brown prediction of reliability Is 
unacceptable (•66)^ Results of the second native spea)cer group 
(2) hardly differ from those In the target group (3)^ In fact 
the hypothesis that they are taken from the same population 
cannot be rejected (Mann-Whitney test: p « •S) • Group 1 differs 
significantly from both other i^roups (p < •OOl), but for a 
group of native speakers (of comparable age and educational 
background as the target population) a near perfect mean score 
with negligible variation is to be expected if the test 
measures language only and at the appropriate levels According 
to the assumption of unidimensionallty in the Rasch model, 
calibrations of item difficulties are population Independent 
and therefore invariant across different groups (Hambleton and 
Murray, 1983) • This assumption was confirmed by a correlation 
of ^97 between UCON item calibrations calculated from two 
different subgroups from group 3 distinguished according to 
year of graduation (N^ = 300, N2 275) • Correlation between 
PROX calibration (N = 30) and UCON calibration (N = 575) was 
•92^ Low correlations (+^ ^40) of item calibrations based on the 
responses in the different groups show that across groups the 
assumption is not confirmed^ Obviously item difficulty ranking 
differs from group to group^ However, correlation of item 
calibrations in two subgroups of the target group (year 1 and 
year 2) is high (^97) which suggests some kind of bias in the 
test^ Hiis bias would seem due to age (group 2 is yoxinger than 
group 1 and 3) and/or native langviage (group 1 and 2: LI, group 
3: L2)^ 

Also, if all items measure the same trait, all items should 
rank individuals (or groups of individuals) in the sar 2 way^ 
Figure 1 shows that most items (88 percent) in the test 
consistently rank group 1 as the most able groups About two 
thirds rank group 2 higher than the second language learners 
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(3), but about one third results in a hiqher ranklno fr.r^ t-v,- 
second language learners. ranking for the 



Figure 1 Hanking^of groups of native speakers and L2 learners 



Items 



ranks : 

■ 1st 
■i2nd 

□ 3rd 



59 
50. 

40- 

30 1 

20 

10 




group 1 




group 2 




group 3 



Rasch analysis (PROX-procedure) showed misfit (p < 05) to h» 
unevenly distributed in the groups: misfit occurred mostJy il 
the second native speaker group and in the oroup of s^coid 



Table 2 Mean and standard deviation of fit statistics (z2) 
total test (n = 59; N = 30) 





Group 1 


Group 2 


Group 3 


Mean (1 +2+3) 


Mean 
S.D. 


.86 
.97 


1.06 
1.35 


1.27 
1.70 


1.06 
.79 



All items were checked for significant bias 



revealed by 
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misfit - favouring any single group or combination of two 
groups: six categories in all. The data were set uo in a 2 x 6 
contingency table against the items selected or rejected in the 
previous study (De Jong, 1983). A significant relation was 
found between the items favouring native speakers in this study 
and the items constituting the best selection in the previous 
study (x^ - 31, df . 5 , p < .005), thus confirmina the 
conclusion that two subs«its of items can be distinguished each 
measuring a different trait. Results of the three groups in the 
present study on these subsets are presented in table 3. 



Table 3 Results of native speakers (1 and 2) and L2 learners 
(3) on two subsets of items 





1 


2 


3 


40 'best' items 








Mean p 


.95 


.87 


.79 


S.D. 


.03 


.07 


.10 


KR 20 


-.15 


.48 


.67 


19 rejected items 








Mean p 


.63 


.54 


.58 


S.D. 


.11 


.1? 


.11 


KR 20 


-.10 


.05 


-.04 



The selection of 40 'best' items clearly distinguishes between 
the three groups (Mann Whitney test: p < ,0001) and 
establishes the order, from high to low, group 1 - qroup 2 - 
group 3. Group 1 obtains a near perfect score" and no 
significant variance in ability can be measured at this level 
amongst the individuals in this group. For the second native 
speaker group the selection is too easy\o establish reliable 
differences between individuals within the group, but 
significant variation in scores can be observed (p < .01). In 
group 3, the second language learners, the test measures 
differences in ability best. Spearman Brown prediction for 
reliability at standard length is acceptable ( .81 ) , mean score 
is just above the ideal: midway between chance score (.5) and 
perfect score. 

The 19 rejected items subset distinguishes less well between 
groups 1 and 3 (Mann Whitney test: (.01 < p < .05) and also 
between groups 2 and 3 (p < .01) but significant difference is 
observed between groups 1 and 2 (Mann Whitney test: p < .005). 
Ihe order of groups 2 and 3 is reversed and group 1 remains the 
group with the highest scores, which suggests that difference 
in language ability leading to the ranking of the groups in the 



72 - 



40 item selection is overruled in these 1Q ii--.™ k 

s^^^t\"%^^%ro^^^^^^ to the two 

underlying performnceT^ dSi^nf* ''W^^^hesis that the trait 
4). in thi Jo-it^;Sai? Al ll^ ^""^ ^J^^'"' ^'^^^^^^ (table 
item, test the same^nJty LI Jr"/f°" ^^^^^^^ 
level of ability between tL ^Jr!! differences in 

items fit the Z^l Tell ITll^^tL^.T"! 

deviating items and f^^sSstSs^rf ?:L?Jee'?v":i'J"^ 
considering the t^st i« i »» relatively high 

item selection! ^"""^ """^^ ^he lenoth of the 40 



Table 4 



Mean and ^standard deviation of fit statl«*.<..o /^2. 
two subsets of items statistics (z^) on 





Group 1 Group 2 Group 3 Mean 1+2+3 


40 'best' items 

Mean 
S.D. 


'63 .98 76 


19 rejected items 

Mean 
S.D 


-67 .80 .74 



s^ser«lJbr^ed""""'"^^ ^''^ the 40-item 

,roups^' c\^^\^\^'„\\;r^^^^^ «^il-r between all 

hypothesis Of unidimenionaU^rof the i;!f; J^rnir^r""' 
Rank ordering is lower in the'l9 iteJ sSs^t ?from 
.79) but remains significant, suggesting thit S^iJ ° 
?i^;! r -^^^-rly proportioned in alf "e„«. 

t"s\^!^- v»rf -a 

ability of the arouDs In i-hL !^ the estimated 

distriLtiof Of "Ability „im„1h^^^^^^ indication of the 

subjects from the 
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on the responses of the three groups distinouished in this 
study, is apparent. Prom the estimated means and standard 
deviations of ability in the three groups it is clear that, 
though no significant difference can be measured with this test 
Between the second group of native speakers (2) and the aroup 
from the target population (3), group 1 stands well apart from 
the two other groups. Estimated variation of ability within all 
three groups is low: less than one logit from the I6th to the 
84tn percentile. 



Figure 2 Test characteristic curve and distribution of ability 
for the total test (n - 59) 



I- I OOt 

K 

& 90 

t 

.80 



.70' 



.40 



.30 



— PROX 
--UCON 



"r~t 



— ABMJTY 




. IMEAN ABUTY AND STA^DARO DEVIATION IM GROUP X 



Figure 3 and 4, present TCC's for the two different subsets of 
items from the total test. Both figures are on the same scale 
as figure 2. Figure 3 shows that the 40 'best' item selection 
leads to an estimate of larger differences in mean ability 
between all three groups than the total test. However, a larae 
amount of overlap exists between the taraet population (3) and 
the second group of native speakers (2). Because of a ceilina 
effect the test has no power to measure significant variation 
in ability amongst individuals of aroup 1. In the other aroups 

! difference in ability between the 

16th and 84th percentile. 

The 19 rejected items (fig. 4) measure less than one half loait 
difference in ability between the means of the group lowest in 
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Pi9fure 3: Ttost characteristic curve and distribution of ability 
for subset of 'best* items (n « 40) 
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Figure 4: ^J^^cteristic curve and distribution of ability 

ioOt for subset of rejected items (n = 19) a«iixT:y 
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mean ability (which, on this subset, is aroup 21) and the aroup 
of the highest mean ability (1). ^viously there is no ceiling 
effect and in spite of observed scores at near chance level 
there is no indication of a floor effect either; ouessina does 
not seem to have taken place. Substantial overlap between all 
three groups and a difference of about one logit between the 
16th and 84th percentile suggest that the trait underlying this 
subset does discriminate but not according to any assumed 
difference in understanding English. Whatever trait the test 
measures, it is altogether different from the trait underlying 
the 40 'best' item selection as is indicated by the reversed 
position of group 2 and 3 as well as by the absence of 
correlation between scores of the target aroup on both subsets 
^^pm = .03; N » 575). 



Discussion 

In a previous study (De Jong, 1983) it was concluded that a 
subset of 40 items from the 59 item listenina comprehension 
test constitutes a valid measure for listening comprehension of 
Enalish as a foreing language whereas 19 items had to be 
rejected because they measure a different ability, possibly to 
be identified as general intelligence or knowledge of the 
world. The present study demonstrates that the same selection 
of 40 items discriminates between two groups of native speakers 
differing in age and educational background and estimates 
sianificant variance in ability within the orouo with lower 
mean score. Of course LI learners do not all achieve equally 
well on tests of their native language - there would be no need 
for LI classes otherwise. The results of this study however 
suggest that language listening ability of LI and L2 learners 
can be measured along a single variable and that this ability 
can be distinguished from an age- and school-tied variable, 
which could be general intelligence and./or knowledae of the 
world. 

The groups, used in this research are small. However, wriaht 
(1977) states that satisfactory calibrations can be achieved 
with tests of more than 20 items on samples of about 100 
persons. Moreover, Wright and Stone (1979) successfully used a 
test of only 14 items on a sample of 34 subjects to demonstrate 
test analysis with the Rasch Model, (cf. also Lord, 1983). 
Wright and Stone (1979) have shown the conformity of analyses 
done by hand with the PROX-procedure and compviter analyses with 
UCON. Because of significant correlation between calibrations 
of items on two subgroups of the target population and between 
calibrations on the three groups in this study, guessing cannot 
have seriously influenced results and calibrations apparently 
suffer little from error due to the small size of the aroups. 
The short distance along the variable between Dutch students at 
the pre-university level and native speakers of Enolish may be 
surprising at first sight. However, the level of Dutch foreign 
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language learners of English appears to be rather hlah as is 
clear from results on TOEFL (Test Of English As a Foreign 
Language) too, Clark (1977) found a mean raw score of 134.7 for 
native Americans^ High School college-bound seniors 
corresponding to a scaled score of 610 (maximum 680)^ whereas 
for native Dutch LI spealcers the mean scaled score was reported 
to be 584 from July 1980 to June 1982^ wU above the mean 
scaled score for all participants of 503 (TOEFL^ 1983). 
The results reported here agree with earlier findinas (Fishman^ 
1980; carrelr 1980y Wilson, 1980). In the Fishman (1980) study 
difficulty level was artificially enhanced by addina white 
noise to a dictation task whereas in this study conditions were 
the same for all groups. Carrel (1980) studied the processina 
of indirectly conveyed meaning by a group of youna children/ 
native speakers of English and adults acquiring English as a 
second language: a much larger difference in development then 
the one between groups 2 and 3 in this study. Wilson (1980) 
could not detect first language interference even with tests 
purposely biased against L2 learners with elements predicted by 
contrastive analysis as difficult for L2 learners with a 
certain LI background. 

However, there is a large amount of literature revealing first 
language interference and language transfer (cf. Gass, 1984). 
Most of these studies test the hypothesis of LI interference 
with discrete point tests tapping productive skills (e.g.: 
Schachter, 1974; Zobl, 1982, 1983? Bourgonje e.a., 1984;' Van 
Buren and Sharwood-Smith, 1984; Van Hest e.a. 1984). Possibly, 
universals and language transfer operate at the receptive and 
productive level respectively and a combination of both 
principals is necessary to account for language acquisition. 
The claim that language teaching shoulc' beain with the 
receptive skills (e.g. Postovsky, 1974; Benson and Hjelt, 1980) 
would be consistent with Gass» suggestion (1984) that language 
universals serve as the overall guiding principle in language 
acquisition. 

The present study uses an integrative test of auditory - 
receptive - language processing to reveal listening 
comprehension as a single trait for LI and L2 learners. 
Language tests inevitably measure languaae ability on manifest 
behaviour: at the performance level. At the competence level it 
may be possible to describe language production as the reversed 
process of language reception. At the performance level, 
however, production appears to be more sensitive to language 
transfer than language reception is. Whether this phenomenon 
constitutes &n intrinsic distinction between production and 
reception at the performance level or is only due to the fact 
that receptive ability - both in LI and in L2 - is Generally 
more developed than productive ability, remains open to further 
investigation. 
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